memory bottleneck
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L)^2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.
Memory-efficient Patch-based Inference for Tiny Deep Learning
Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult.
T-SAR: A Full-Stack Co-design for CPU-Only Ternary LLM Inference via In-Place SIMD ALU Reorganization
Oh, Hyunwoo, Nam, KyungIn, Bhattacharjya, Rajat, Chen, Hanning, Das, Tamoghno, Yun, Sanggeon, Jang, Suyeon, Ding, Andrew, Dutt, Nikil, Imani, Mohsen
Recent advances in LLMs have outpaced the computational and memory capacities of edge platforms that primarily employ CPUs, thereby challenging efficient and scalable deployment. While ternary quantization enables significant resource savings, existing CPU solutions rely heavily on memory-based lookup tables (LUTs) which limit scalability, and FPGA or GPU accelerators remain impractical for edge use. This paper presents T-SAR, the first framework to achieve scalable ternary LLM inference on CPUs by repurposing the SIMD register file for dynamic, in-register LUT generation with minimal hardware modifications. T-SAR eliminates memory bottlenecks and maximizes data-level parallelism, delivering 5.6-24.5x and 1.1-86.2x improvements in GEMM latency and GEMV throughput, respectively, with only 3.2% power and 1.4% area overheads in SIMD units. T-SAR achieves up to 2.5-4.9x the energy efficiency of an NVIDIA Jetson AGX Orin, establishing a practical approach for efficient LLM inference on edge platforms.
- Europe > Austria > Vienna (0.14)
- Europe > France > Auvergne-Rhône-Alpes > Lyon > Lyon (0.05)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (7 more...)
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification
He, Yefei, Chen, Feng, Liu, Jing, Shao, Wenqi, Zhou, Hong, Zhang, Kaipeng, Zhuang, Bohan
The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform sparse attention mechanism solely on those important tokens, reducing the latency in the prefill phase. Tokens deemed less important will be discarded to reduce KV cache size, alleviating the memory bottleneck in the decoding phase. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.3$\times$ and improve decoding throughput by 2.8$\times$, with a minimal accuracy reduction of only 0.5\% on VQAv2 benchmark over LLaVA-Next-13B model, effectively enhancing the generation efficiency of LVLMs.
- Asia > China > Shanghai > Shanghai (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > South Australia > Adelaide (0.04)
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot- product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L) 2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget.
Memory-efficient Patch-based Inference for Tiny Deep Learning
Tiny deep learning on microcontroller units (MCUs) is challenging due to the limited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose receptive field redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead.
TinyML is bringing deep learning models to microcontrollers
This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence. Deep learning models owe their initial success to large servers with large amounts of memory and clusters of GPUs. The promises of deep learning gave rise to an entire industry of cloud computing services for deep neural networks. Consequently, very large neural networks running on virtually unlimited cloud resources became very popular, especially among wealthy tech companies that can foot the bill. But at the same time, recent years have also seen a reverse trend, a concerted effort to create machine learning models for edge devices.
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
Li, Shiyang, Jin, Xiaoyong, Xuan, Yao, Zhou, Xiyou, Chen, Wenhu, Wang, Yu-Xiang, Yan, Xifeng
Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot- product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length L, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only O(L(log L) 2) memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget.
3D Electronic Nose Demostrates Advantages of Carbon Nanotubes
You'd think computers spend most of their time and energy doing, well, computation. But that's not the case: about 90 percent of a computer's execution time and electrical energy is spent transferring data between the processor and the memory banks, says Subhasish Mitra, a computer scientist at Stanford University. Even if Moore's law continued on indefinitely, computers would still be limited by this memory bottleneck. This week in the journal Nature, Mitra and collaborators describe a new computer architecture they say addresses this problem--and that Mitra believes will improve both the energy efficiency and speed of computers by a factor of 1000. The new 3D architecture is based on novel devices including 2 million carbon nanotube transistors and over 1 million resistive RAM cells, all built on top of a layer of silicon using existing fabrication methods and connected by densely packed metal wiring between the layers. As a demonstration, the team built an electronic nose that can sense and identify several common vapors including lemon juice, rubbing alcohol, vodka, wine, and beer.
- Materials (0.70)
- Semiconductors & Electronics (0.70)
3D Electronic Nose Demostrates Advantages of Carbon Nanotubes
You'd think computers spend most of their time and energy doing, well, computation. But that's not the case: about 90 percent of a computer's execution time and electrical energy is spent transferring data between the processor and the memory banks, says Subhasish Mitra, a computer scientist at Stanford University. Even if Moore's law continued on indefinitely, computers would still be limited by this memory bottleneck. This week in the journal Nature, Mitra and collaborators describe a new computer architecture they say addresses this problem--and that Mitra believes will improve both the energy efficiency and speed of computers by a factor of 1000. The new 3D architecture is based on novel devices including 2 million carbon nanotube transistors and over 1 million resistive RAM cells, all built on top of a layer of silicon using existing fabrication methods and connected by densely packed metal wiring between the layers.
- Materials (0.72)
- Semiconductors & Electronics (0.51)